Hadoop: HDFS Architecture

Apache HDFS or Hadoop Distributed File System is a block-structured file system where each file is divided into blocks of a pre-determined size. These blocks are stored across a cluster of one or several machines. Apache Hadoop HDFS Architecture follows a Master/Slave Architecture, where a cluster comprises of a single NameNode (Master node) and all the other nodes are DataNodes (Slave nodes). HDFS can be deployed on a broad spectrum of machines that support Java. Though one can run several DataNodes on a single machine, but in the practical world, these DataNodes are spread across various machines.

Name Node

NameNode is the master node in the Apache Hadoop HDFS Architecture that maintains and manages the blocks present on the DataNodes (slave nodes). NameNode is a very highly available server that manages the File System Namespace and controls access to files by clients. The HDFS architecture is built in such a way that the user data never resides on the NameNode. The data resides on DataNodes only. NameNode does not directly read or write to the HDFS. When a client requests a file to read, the NameNode provides the client with the block locations of the file and the client reads the file directly. When a client creates a new file, the NameNode provides the client with a list of block locations ordered by distance from the client and the client writes to the blocks directly.

Functions of Name Node

It is the master daemon that maintains and manages the DataNodes (slave nodes)
NameNode's function is to maintain the HDFS namespace metadata, which includes the filenames, directory names, file permissions, directory permissions, file-to-block mapping, block IDs, and block locations in RAM.
The metadata is kept in RAM for fast access
There are two files associated with the metadata:
FsImage: It contains the complete state of the file system namespace since the start of the NameNode.
EditLogs: It contains all the recent modifications made to the file system with respect to the most recent FsImage.
It records each change that takes place to the file system metadata. For example, if a file is deleted in HDFS, the NameNode will immediately record this in the EditLog.
It regularly receives a Heartbeat and a block report from all the DataNodes in the cluster to ensure that the DataNodes are live.
It keeps a record of all the blocks in HDFS and in which nodes these blocks are located.
The NameNode is also responsible to take care of the replication factor of all the blocks.
In case of the DataNode failure, the NameNode chooses new DataNodes for new replicas, balance disk usage and manages the communication traffic to the DataNodes.

In CDH5, the NameNode namespace URI is configured in the configuration property fs.defaultFS in core-default.xml.

<name>fs.defaultFS</name>

</property>

Another important property that must be configured is dfs.permissions.superusergroup in hdfssite.xml. It sets the UNIX group whose users are to be the superusers for HDFS.

<name>dfs.permissions.superusergroup</name>

<value>hadoop</value>

</property>

DataNode

DataNodes are the slave nodes in HDFS. Unlike NameNode, DataNode is a commodity hardware, that is, a non-expensive system which is not of high quality or high-availability. The DataNode is a block server that stores the data in the local file ext3 or ext4.

Functions of DataNode:

These are slave daemons or process which runs on each slave machine.
The actual data is stored on DataNodes.
The DataNodes perform the low-level read and write requests from the file system’s clients.
They send heartbeats to the NameNode periodically to report the overall health of HDFS, by default, this frequency is set to 3 seconds.

**Difference between Name Node and Data Node

NameNode is a highly available server that manages the File System Namespace and maintains the metadata information. Therefore, NameNode requires higher RAM for storing the metadata information corresponding to the millions of HDFS files in the memory, whereas the DataNode needs to have a higher disk capacity for storing huge data sets.

Secondary NameNode

The Secondary NameNode works concurrently with the primary NameNode as a helper daemon. And don’t be confused about the Secondary NameNode being a backup NameNode because it is not.

Functions of Secondary NameNode

The Secondary NameNode is one which constantly reads all the file systems and metadata from the RAM of the NameNode and writes it into the hard disk or the file system.
It is responsible for combining the EditLogs with FsImage from the NameNode.
It downloads the EditLogs from the NameNode at regular intervals and applies to FsImage.
The new FsImage is copied back to the NameNode, which is used whenever the NameNode is started the next time.Hence, Secondary NameNode performs regular checkpoints in HDFS. Therefore, it is also called CheckpointNode.

How NameNode handles Datanode failures in Hadoop HDFS?

HDFS has master-slave architecture in which master is namenode and slave is datanode. HDFS cluster has single namenode that manages file system namespace (metadata) and multiple datanodes that are responsible for storing actual data in HDFS and performing the read-write operation as per request for the clients. In HDFS, each DataNode sends Heartbeat and Data Blocks report to NameNode.Receipt of a heartbeat implies that the datanode is functioning properly. A block report contains a list of all blocks on a datanode. Data node passes a heartbeat signal to Name node in an interval of 3 seconds.When Name node does not receive heartbeat signals from Data node, it assumes that the data node is either dead or non-functional. As soon as the data node is declared dead/non-functional all the data blocks it hosts are transferred to the other data nodes with which the blocks are replicated initially. This is how Namenode handles datanode failures.

HDFS Architecture
Apache HDFS or Hadoop Distributed File System is a block-structured file system where each file is divided into blocks of a pre-determined size. These blocks are stored across a cluster of one or several machines. Apache Hadoop HDFS Architecture follows a Master/Slave Architecture, where a cluster comprises of a single NameNode (Master node) and all the other nodes are DataNodes (Slave nodes). HDFS can be deployed on a broad spectrum of machines that support Java. Though one can run several DataNodes on a single machine, but in the practical world, these DataNodes are spread across various machines.

Hadoop

HDFS Architecture

No comments:

Post a Comment